A tutorial for P&C pricing using R.
Advances in machine learning and open source technologies are having a profound impact on many industries, and insurance is no exception. In the US, both the Society of Actuaries and CAS Institute list knowledge of R as a requirement in exam syllabi (Society of Actuaries 2018; CAS Institute 2018). In a draft white paper on regulatory review of predictive models, the National Association of Insurance Commisioners (NAIC) recognizes that actuaries are using techniques such as random forests and neural networks in predictive modeling projects (Casualty Actuarial and Statistical (C) Task Force 2010). However, while much has been written on modeling methodology for pricing, there have been few freely available resources that include reproducible code; one such example is (Alfredo 2011). To the best of our knowledge, prior to the current article, no tutorials exist that show the end-to-end pricing workflow.
In this tutorial, we walk through an example of a pricing project using R, from preparing raw data for modeling to filing. Throughout the process, we provide reproducible code for each analysis task so the reader can follow along. We also attempt to adhere to the latest regulatory guidelines, so actuaries can better adopt the workflow presented in their work.
In this tutorial, we use the publicly available data from Brazil’s Superintendence of Private Insurance (SUSEP). It maintains policy-characteristics-level data for personal auto from 2007 to 2014 for all insured vehicles in Brazil. This data source was chosen because it contains a variety of variables, from policyholder characteristics to losses by peril, to support a nontrivial analysis, and contains a variety of mapping tables so we can demonstrate a broader range of data manipulation work found in a typical pricing assignment. You can access the data from SUSEP’s AUTOSEG website.
(Talk about using 2013 data and validation scheme, see issues/41)
With the help of documentation accompanying the datasets, we compile a list of tables shown below.
(these need to be in better English see issues/42)
| File | Description |
|---|---|
| arq_casco | Exposure data, premiums, claims and insured amount for the CASCO overhead, classified by the Key Category Rate / Region / Model / Year / Sex / Age Range |
| arq_casco3 | Exposure data, premiums and claims for the CASCO overhang, classified by the Key Rate Category / CEP / Model / Year key |
| arq_casco4 | Exposure data, premiums and claims for the CASCO overhang, classified by the Key Rate Category / City / Model / Year |
| premreg | Regional distribution of prices |
| sinreg | Regional distribution of claims |
| File | Description |
|---|---|
| auto2_vei | FIPE code and description of each vehicle model, in addition to the group code to which it belongs |
| auto2_group | Code and description of model groups it contains |
| auto_cat | Description code of tariff categories |
| auto_cau | Code and description of causes of accidents |
| auto_cep | Correlates the CEP with cities and regions of circulation |
| auto_cob | Code and description of covers |
| auto_idade | Code and description of age groups |
| auto_reg | Code and description of regions of circulation |
| auto_sexo | Code and description of sex (male, female, legal) |
| auto_city | Code and name of cities |
The excerpts for each of the source tables, before any transformations, can be found at Raw Table Excerpts.
We note that, for the purpose of this exercise, we’ll use the arq_casco policy table since it is the only one that contains poilcyholder characteristics (sex and age.) We are unable to utilize arq_casco3 and arq_casco4, which contain more granular location information, since there are no keys to join by; presumably the data is presented this way to preserve privacy.
Because the source data is in Portuguese, the first order of business is to obtain translations of the column names so we can all understand them. TODO: dictionary
Once that’s done, we’ll take an initial look at the tables:
Skim summary statistics
n obs: 2416213
n variables: 22
Variable type: character
| variable | missing | complete | n | min | max | empty | n_unique |
|---|---|---|---|---|---|---|---|
| age_code | 0 | 2416213 | 2416213 | 1 | 1 | 0 | 6 |
| data_year | 0 | 2416213 | 2416213 | 5 | 5 | 0 | 1 |
| region_code | 0 | 2416213 | 2416213 | 2 | 2 | 0 | 43 |
| sex_code | 0 | 2416213 | 2416213 | 1 | 1 | 0 | 4 |
| vehicle_category_code | 0 | 2416213 | 2416213 | 1 | 1 | 0 | 8 |
| vehicle_code | 0 | 2416213 | 2416213 | 8 | 8 | 0 | 5364 |
Variable type: integer
| variable | missing | complete | n | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| vehicle_year | 12 | 2416201 | 2416213 | 2006.91 | 5.16 | 1935 | 2005 | 2008 | 2011 | 2020 | <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587><U+2586> |
Variable type: numeric
| variable | missing | complete | n | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| average_insured_amount | 0 | 2416213 | 2416213 | 41619.57 | 40002.72 | 0 | 20876.66 | 31596.81 | 50560.36 | 9593245.26 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_amount_collision_partial | 0 | 2416213 | 2416213 | 841.11 | 5979.58 | 0 | 0 | 0 | 0 | 2351086 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_amount_collision_total_loss | 0 | 2416213 | 2416213 | 542.3 | 6094.13 | 0 | 0 | 0 | 0 | 2300123 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_amount_fire | 0 | 2416213 | 2416213 | 16.81 | 1089.48 | 0 | 0 | 0 | 0 | 459690 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_amount_other | 0 | 2416213 | 2416213 | 233.82 | 3886.65 | 0 | 0 | 0 | 0 | 1307819 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_amount_theft | 0 | 2416213 | 2416213 | 686.06 | 8070.97 | 0 | 0 | 0 | 0 | 2056463 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_count_collision_partial | 0 | 2416213 | 2416213 | 0.2 | 4.59 | 0 | 0 | 0 | 0 | 4963 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_count_collision_total_loss | 0 | 2416213 | 2416213 | 0.016 | 0.14 | 0 | 0 | 0 | 0 | 11 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_count_fire | 0 | 2416213 | 2416213 | 0.00069 | 0.027 | 0 | 0 | 0 | 0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_count_other | 0 | 2416213 | 2416213 | 0.49 | 4.64 | 0 | 0 | 0 | 0 | 1193 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_count_theft | 0 | 2416213 | 2416213 | 0.026 | 0.26 | 0 | 0 | 0 | 0 | 82 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| EXPOSICAO2 | 0 | 2416213 | 2416213 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581> |
| exposure | 0 | 2416213 | 2416213 | 2.96 | 13.71 | 0 | 0.47 | 0.56 | 1.76 | 5420.13 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| PREMIO2 | 0 | 2416213 | 2416213 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581> |
| premium | 0 | 2416213 | 2416213 | 3375.05 | 17282.63 | 0 | 377.64 | 886.77 | 2330.22 | 1e+07 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
Skim summary statistics
n obs: 8
n variables: 2
Variable type: character
| variable | missing | complete | n | min | max | empty | n_unique |
|---|---|---|---|---|---|---|---|
| vehicle_category | 0 | 8 | 8 | 6 | 39 | 0 | 8 |
| vehicle_category_code | 0 | 8 | 8 | 1 | 1 | 0 | 8 |
Skim summary statistics
n obs: 41
n variables: 2
Variable type: character
| variable | missing | complete | n | min | max | empty | n_unique |
|---|---|---|---|---|---|---|---|
| region | 0 | 41 | 41 | 8 | 47 | 0 | 41 |
| region_code | 0 | 41 | 41 | 2 | 2 | 0 | 41 |
Skim summary statistics
n obs: 4756
n variables: 4
Variable type: character
| variable | missing | complete | n | min | max | empty | n_unique |
|---|---|---|---|---|---|---|---|
| vehicle_code | 0 | 4756 | 4756 | 8 | 8 | 0 | 4756 |
| vehicle_description | 0 | 4756 | 4756 | 6 | 58 | 0 | 4756 |
| vehicle_group | 0 | 4756 | 4756 | 3 | 33 | 0 | 476 |
| vehicle_group_code | 0 | 4756 | 4756 | 1 | 3 | 0 | 476 |
Skim summary statistics
n obs: 6
n variables: 2
Variable type: character
| variable | missing | complete | n | min | max | empty | n_unique |
|---|---|---|---|---|---|---|---|
| age_code | 0 | 6 | 6 | 1 | 1 | 0 | 6 |
| age_range | 0 | 6 | 6 | 20 | 20 | 0 | 6 |
Skim summary statistics
n obs: 5
n variables: 2
Variable type: character
| variable | missing | complete | n | min | max | empty | n_unique |
|---|---|---|---|---|---|---|---|
| peril | 0 | 5 | 5 | 6 | 19 | 0 | 5 |
| peril_code | 0 | 5 | 5 | 1 | 1 | 0 | 5 |
Skim summary statistics
n obs: 476
n variables: 2
Variable type: character
| variable | missing | complete | n | min | max | empty | n_unique |
|---|---|---|---|---|---|---|---|
| vehicle_group_code | 0 | 476 | 476 | 1 | 3 | 0 | 476 |
| vehicle_group_description | 0 | 476 | 476 | 3 | 33 | 0 | 476 |
For the policy table, with the exception of the numeric columns, all of the columns contain coded values which need to be mapped. By referencing the data documentation, we observe the data model diagram below TODO: actually do this.
(need more details here issues/44)
We then perform a series of joins and extract the relevant columns to create a combined data frame, then inspect the result for reasonableness:
Skim summary statistics
n obs: 2416213
n variables: 22
Variable type: character
| variable | missing | complete | n | min | max | empty | n_unique |
|---|---|---|---|---|---|---|---|
| age_range | 0 | 2416213 | 2416213 | 13 | 18 | 0 | 6 |
| data_year | 0 | 2416213 | 2416213 | 5 | 5 | 0 | 1 |
| region | 5329 | 2410884 | 2416213 | 8 | 47 | 0 | 41 |
| sex | 0 | 2416213 | 2416213 | 8 | 14 | 0 | 4 |
| vehicle_category | 0 | 2416213 | 2416213 | 6 | 39 | 0 | 8 |
| vehicle_description | 368687 | 2047526 | 2416213 | 6 | 58 | 0 | 4298 |
| vehicle_group | 368687 | 2047526 | 2416213 | 3 | 33 | 0 | 445 |
| vehicle_group_description | 368687 | 2047526 | 2416213 | 3 | 33 | 0 | 445 |
Variable type: integer
| variable | missing | complete | n | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| vehicle_year | 12 | 2416201 | 2416213 | 2006.91 | 5.16 | 1935 | 2005 | 2008 | 2011 | 2020 | <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587><U+2586> |
Variable type: numeric
| variable | missing | complete | n | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| average_insured_amount | 0 | 2416213 | 2416213 | 41619.57 | 40002.72 | 0 | 20876.66 | 31596.81 | 50560.36 | 9593245.26 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_amount_collision_partial | 0 | 2416213 | 2416213 | 841.11 | 5979.58 | 0 | 0 | 0 | 0 | 2351086 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_amount_collision_total_loss | 0 | 2416213 | 2416213 | 542.3 | 6094.13 | 0 | 0 | 0 | 0 | 2300123 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_amount_fire | 0 | 2416213 | 2416213 | 16.81 | 1089.48 | 0 | 0 | 0 | 0 | 459690 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_amount_other | 0 | 2416213 | 2416213 | 233.82 | 3886.65 | 0 | 0 | 0 | 0 | 1307819 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_amount_theft | 0 | 2416213 | 2416213 | 686.06 | 8070.97 | 0 | 0 | 0 | 0 | 2056463 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_count_collision_partial | 0 | 2416213 | 2416213 | 0.2 | 4.59 | 0 | 0 | 0 | 0 | 4963 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_count_collision_total_loss | 0 | 2416213 | 2416213 | 0.016 | 0.14 | 0 | 0 | 0 | 0 | 11 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_count_fire | 0 | 2416213 | 2416213 | 0.00069 | 0.027 | 0 | 0 | 0 | 0 | 3 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_count_other | 0 | 2416213 | 2416213 | 0.49 | 4.64 | 0 | 0 | 0 | 0 | 1193 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| claim_count_theft | 0 | 2416213 | 2416213 | 0.026 | 0.26 | 0 | 0 | 0 | 0 | 82 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| exposure | 0 | 2416213 | 2416213 | 2.96 | 13.71 | 0 | 0.47 | 0.56 | 1.76 | 5420.13 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| premium | 0 | 2416213 | 2416213 | 3375.05 | 17282.63 | 0 | 377.64 | 886.77 | 2330.22 | 1e+07 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
(talk about missing values)
Distributions “One-way” analysis
(something like this, will need to add tooltip and cut out Brazil)